Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MERGE on cloned table in Delta Lake #24756

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chenjian2664
Copy link
Contributor

@chenjian2664 chenjian2664 commented Jan 21, 2025

Description

Fix problem that fail update on cloned table, reproduce steps:

testing/bin/ptl env up --environment singlenode-delta-lake-oss

In Trino: create schema delta.tiny with (location='s3://test-bucket/tiny/');

In Spark-sql: CREATE TABLE tiny.t1 (id int, v string, part date) USING DELTA PARTITIONED BY (part);

In Trino: insert into delta.tiny.t1 values (1, 'A', TIMESTAMP '2024-01-01'), (2, 'B', TIMESTAMP '2024-01-01'), (3, 'C', TIMESTAMP '2024-02-02'), (4, 'D', TIMESTAMP '2024-02-02');

In Spark-sql: CREATE TABLE tiny.t1clone SHALLOW CLONE tiny.t1;

In Trino: update delta.tiny.t1clone set v = 'update1' where id in (1,3); It fails with:

Query 20240904_130833_00010_r4mc9 failed: path [s3://test-bucket/tiny/t1/part=2024-02-02/20240904_125421_00003_r4mc9_8e16bc1c-e33d-45da-b085-427d66e55960] must be a subdirectory of basePath [s3://test-bucket/tiny/t1clone]

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Delta Lake
* Support MERGE on cloned table. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jan 21, 2025
@github-actions github-actions bot added the delta-lake Delta Lake connector label Jan 21, 2025
@chenjian2664 chenjian2664 force-pushed the delta_clone branch 6 times, most recently from 37ff93a to 60e9e73 Compare January 21, 2025 11:47
@ebyhr ebyhr requested review from ebyhr, pajaks and vinay-kl January 22, 2025 04:20
@chenjian2664 chenjian2664 force-pushed the delta_clone branch 2 times, most recently from 9bc68fc to 0cb664f Compare January 23, 2025 08:52
@chenjian2664 chenjian2664 force-pushed the delta_clone branch 2 times, most recently from 7ef5f5d to be52ef7 Compare January 26, 2025 07:59
@chenjian2664 chenjian2664 requested a review from pajaks January 27, 2025 07:30
@chenjian2664 chenjian2664 force-pushed the delta_clone branch 3 times, most recently from b583ce9 to e8d8d33 Compare February 4, 2025 14:20
@chenjian2664 chenjian2664 requested a review from pajaks February 4, 2025 14:30
USING DELTA
TBLPROPERTIES ('delta.enableDeletionVectors' = 'true');

INSERT INTO clone_merge_deletion_vector_source VALUES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This insert creates single rows in each parquet files, so following delete just removes entire files and there is no deletionVector in tansactionlog.
It should be something like:

{"add":{"path":"20250206_113723_00025_y7b6p_79b0624a-b032-4e9f-b67b-9c84b9960729","partitionValues":{},"size":511,"modificationTime":1738841844050,"dataChange":true,"stats":"{\"numRecords\":2,\"minValues\":{\"id\":2,\"v\":\"updated\",\"part\":\"2024-01-01\"},\"maxValues\":{\"id\":4,\"v\":\"updated\",\"part\":\"2024-02-02\"},\"nullCount\":{\"id\":0,\"v\":0,\"part\":0}}","tags":{},"deletionVector":{"storageType":"u","pathOrInlineDv":"-z*atcBlDyPB90fEl>c^","offset":1,"sizeInBytes":38,"cardinality":1}}}

Not sure how to enforce more values in one file thou. @ebyhr Do you know?

Copy link
Contributor Author

@chenjian2664 chenjian2664 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you wanna see the cloned table read the source table DVs?
It's not support to read the 'p' type of DV now which in the cloned table.
Even we support read the 'p' type DV, still, need to change the path to relative path in DV, since we are loading the table from resource don't know the prefix of the absolute path. But that would need more extra logic to "allow read relative path as well" in the implementation of the support read absolute path DV.

Due to the constraints, seems now add the DV tests into product test is more natural to do, WDYT

Copy link
Contributor Author

@chenjian2664 chenjian2664 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #24946 , when this finalized we can add this test by modifying the refered paths when loading tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cloned table will always have p type I assume? If that's the case then it's also ok that cloned table with vectors would not be supported. So either #24946 can be treated as prerequisite, or leave this test in current state add product test that shows Trino failure in such case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so far seems so. I prefer to update the test after p type dv read is supported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

2 participants